[dev] split fused moe experts to ensure quantization#2464
[dev] split fused moe experts to ensure quantization#2464liwei109 wants to merge 1 commit intovllm-project:mainfrom
Conversation
Signed-off-by: Li Wei <liwei.109@outlook.com>
|
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the model-free post-training quantization (PTQ) capabilities by addressing a specific challenge with Mixture-of-Experts (MoE) models that have fused expert weights. By introducing a mechanism to split these fused weights into their individual components, the PR ensures that such models can be accurately and effectively quantized, expanding the range of models supported by the quantization framework. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism to split fused MoE experts, which is a necessary step for quantizing models like Qwen3.5. The approach is sound, but I've found a few critical bugs in the implementation of split_fused_moe_experts. Specifically, the logic for generating new tensor names is incorrect and will lead to invalid keys. Additionally, there's a case where a tensor could be inadvertently dropped. I've also included some suggestions to improve code clarity and logging practices. Please address the identified bugs to ensure the feature works as expected.
| print(f"Warning: gate_up_proj {name} has odd second dimension: {tensor.shape}") | ||
| continue | ||
|
|
There was a problem hiding this comment.
There are two issues here:
- When a
gate_up_projtensor has an odd second dimension, the codecontinues without adding the original tensor to the_tensorsdictionary. This will cause the tensor to be dropped from the model, which is a significant bug. - The use of
printfor logging is not ideal for a library. Please use a proper logger, likeloguru, which is used elsewhere in the project. This would also require importingloggerat the top of the file.
The print statements on lines 240 and 252 should also be converted to logger.info for consistency.
| print(f"Warning: gate_up_proj {name} has odd second dimension: {tensor.shape}") | |
| continue | |
| logger.warning(f"Warning: gate_up_proj {name} has odd second dimension: {tensor.shape}, skipping split") | |
| _tensors[name] = tensor | |
| continue |
| base_key = name.replace("mlp.experts.gate_up_proj", f"mlp.experts.{expert_idx}") | ||
| _tensors[base_key + ".gate_proj.weight"] = gate_proj | ||
| _tensors[base_key + ".up_proj.weight"] = up_proj |
There was a problem hiding this comment.
The logic for generating new tensor names for the split gate_proj and up_proj experts is incorrect. It constructs the new key by appending to the original tensor name which already includes .weight, resulting in an invalid key like ...weight.gate_proj.weight. You should remove the .weight suffix from the original name before creating the new keys.
| base_key = name.replace("mlp.experts.gate_up_proj", f"mlp.experts.{expert_idx}") | |
| _tensors[base_key + ".gate_proj.weight"] = gate_proj | |
| _tensors[base_key + ".up_proj.weight"] = up_proj | |
| base_key = name.rsplit(".weight", 1)[0].replace("mlp.experts.gate_up_proj", f"mlp.experts.{expert_idx}") | |
| _tensors[base_key + ".gate_proj.weight"] = gate_proj | |
| _tensors[base_key + ".up_proj.weight"] = up_proj |
| new_key = name.replace("mlp.experts.down_proj", f"mlp.experts.{expert_idx}") + ".down_proj.weight" | ||
| _tensors[new_key] = down_proj |
There was a problem hiding this comment.
Similar to the gate_up_proj case, the key generation for down_proj is incorrect. It appends .down_proj.weight to a name that already ends with .weight. Please correct this by first removing the original .weight suffix.
| new_key = name.replace("mlp.experts.down_proj", f"mlp.experts.{expert_idx}") + ".down_proj.weight" | |
| _tensors[new_key] = down_proj | |
| new_key = name.rsplit(".weight", 1)[0].replace("mlp.experts.down_proj", f"mlp.experts.{expert_idx}") + ".down_proj.weight" | |
| _tensors[new_key] = down_proj |
| Returns: | ||
| New dictionary with split expert weights | ||
| """ | ||
| _tensors = {} |
There was a problem hiding this comment.
| print(f"Warning: gate_up_proj {name} has odd second dimension: {tensor.shape}") | ||
| continue | ||
|
|
||
| hidden_size = tensor.shape[1] // 2 |
There was a problem hiding this comment.
The variable hidden_size is confusing. Based on the comment on line 220 (gate_up_proj is typically [num_experts, 2*intermediate, hidden]), tensor.shape[1] is 2 * intermediate_size. Therefore, this variable actually holds the intermediate_size. Renaming it would improve clarity. You will also need to update its usages on lines 232 and 233.
| hidden_size = tensor.shape[1] // 2 | |
| intermediate_size = tensor.shape[1] // 2 |
| expert_tensor = tensor[expert_idx] # [2*hidden, intermediate] | ||
|
|
||
| # Split gate and up projections | ||
| gate_proj = expert_tensor[:hidden_size, :] |
There was a problem hiding this comment.
might be a little more readable to just do the full indexing at the same time instead of doing it piecemeal, these are all views so performance will be equivalent
| Returns: | ||
| New dictionary with split expert weights | ||
| """ | ||
| _tensors = {} |
There was a problem hiding this comment.
rename as fused_tensors
HDCharles
left a comment
There was a problem hiding this comment.
see comments and bot comments, did you run the test successfully, it looks like the x.weight name bug should have produced an error, no?
|
The quality checks have failed. Please run |
SUMMARY:
This PR adds a
split_fused_moe_expertsfunction for model-free quantization. This ensures that models with fused MoE layers (e.g., Qwen3.5 and Qwen3-VL) containing fusedgate_up_projanddown_projweights can be effectively quantized.TEST PLAN:
We add an example of Qwen3.5 for testing purpose.